Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Variant Discovery ◾ 135

RGLP: This argument sets the RG field which holds the technology name used for

sequencing. The value can be ILLUMINA, SOLID, LS454, HELICOS, or PACBIO.

RGLB: This sets the RG field which holds the DNA preparation library identifier. The

“MarkDuplicates” function of GATK uses this field to determine which RGs contain

duplicates.

The following script creates the directory “RG” and uses Picard to add the RG to each

BAM file. We use the run ID as RG and sample number.

mkdir RG

cd dedup

for i in $(ls *.bam|rev|cut -c 5-|rev);

java -jar ~/software/picard.jar AddOrReplaceReadGroups \

I=${i}.bam \

O=../RG/${i}.RG.bam \

RGID=${i} \

RGLB=lib RGPL=ILLUMINA \

SORT_ORDER=coordinate \

RGPU=bar1 RGSM=${i}

samtools index ../RG/${i}.RG.bam

done

cd ..

4.2.2.2.9 Building a model for the BQSR

We already know that the raw data may have systematic errors that may affect reporting

of the base calling quality score. Such error may lead to overestimate or underestimate

the reported quality score. The quality of variant calling basically depends on the qual-

ity scores of the base calling that will also affect the read alignment. To minimize the

effect of the systematic errors on variant calling, a BQSR is implemented by GATK4

best practice. The BQSR is a machine learning-based method that uses training data to

model the empirically observed errors and adjust the quality scores of the aligned reads

using that model. The adjusted scores are then used by the variant caller to take deci-

sion about a variant calling. The BQSR is achieved in two steps: (i) using a set of known

variants as a training dataset for building the recalibration table (with BaseRecalibrator

GATK4 function) and (ii) adjusting the base quality scores (with ApplyBQSR GATK4

function). The first step of recalibration process generates a table indicating which sites

of the BAM file need adjustment of quality score. The second step of the recalibra-

tion process applies recalibration or adjusting the quality scores. The BQSR generates a

new BAM file with recalibrated quality scores that variant calling process can rely on.

Moreover, the known variants are used to mark the bases at the sites of real variation

to avoid being ignored as artifacts. The model training requires high-quality variant

datasets (SNPs and InDels) in VCF files downloaded from a reliable source such as

NCBI database. Human variant VCF files can also be downloaded from GATK resource

bundle as mentioned above.